Group 4 Project Assignment: NYC Flights

Introduction

In recent years, flight delays have cost the airline industry millions of dollars and have become a recurring problem. Therefore, it is essential to understand the behavior of flight delays.

This report’s objective is to analyse the data set (NYC Flights) and offer suggestions for reducing the departure delays. Time, weather, season, carrier, aircraft, and many other factors might cause a flight departure delay. To understand this, we have examined the relationship between departure delay and a range of other characteristics in the data set.

NYC Flights data sets provided include flight information for three New York City airports: John F. Kennedy International Airport (JFK), Newark Liberty International Airport (EWR), and La Guardia Airport (LGA). Furthermore, they contain data on the weather, airports, airlines, and planes. The main goal here is reducing the departure delay.

1. Description of the problem

As a team, we are working for a business analytics consulting company. Port Authority of New York and New Jersey (PANYNJ) approached our company and requested that we analyze historical data to understand general trends in flight patterns and airport performance in NYC and examine different issues related to departure delays. Below are sections that detail all tasks associated with our analysis.

2. Assumption

1. All the three airports in question [EWR,JFK,LGA] follow similar air traffic regulations, security policies, boarding policies and baggage handling systems.

2. We are assuming that the time required for the individual flights to take off from or touchdown on the runway is the same in all three airports.

3. We are only taking into account the flights that have a positive departure delay. Those with negative values (early departures) is not considered.

4. We are assuming that flight cancellation will only result from egregious weather conditions and irreparable operational or technical issues in the aircraft.

5. When calculating average departure delay time and number of flights delayed, we exclude the 1% quantile largest observations,in order to sift out extremely high values.

3. Potential data issue in the data set

1. Bias in analysis could occur due to the lack of flight data from 12 am to 5 am.

2. The data set does not contain certain significant factors that adequately explain flight delays, particularly delays caused by air traffic congestion, boarding/airline problems, etc.

3. Co-linearity exist between different parameters in data sets provided particularly in the case of weather which could lead to inaccurate insights from analysis.

4. The insights drawn from the analysis is not accurate to the present day conditions that contribute towards the departure delay in the NYC airports due to the fact that the data set is based in 2013.

5. Due to the fact that the data of several characteristics in different data set has skewed distributions the standard deviation could be highly inflated in those cases, making those characteristics a poor measure of variability.

6. The exploratory analysis does not address the underlying cause of departure delays; instead, it concentrates on visualizing the key characteristics of data sets.

7. The data set does not provide adequate information regarding the distinct categories of flights such as commercial passenger flights , freight air crafts or private jets.

4. Objectives

This report aims to answer the following questions

Question 1 - Is there a pattern to the departure delay in terms of time? (Month, Day of week and Hour)

Question 2 - How does weather impact flights from NYC? What is the effect of weather on departure delay?

Question 3 - What is effect of departure delay on airport and carrier ? Which airport and carrier are the best and the worst ?

Question 4 - What is the impact of plane manufacturer and structure of the aircraft on departure delays?

Question 5 : Is there a pattern to the departure delay in terms of geography of our analysis?

Exploratory Data Analysis

1.Setting up the environment and loading the library

library(tidyverse)
library(dplyr)
library(xray)
library(ggplot2)
library(lubridate)
library(corrgram)
library(corrplot)

2.Reading Data set

flights <-read.csv(file="flights.csv")
airlines <- read.csv(file ="airlines.csv")
planes <- read.csv(file="planes.csv")
airports <-read.csv(file="airports.csv")
weather <- read.csv(file="weather.csv")

Data set overview

image: Data set overview Source:http://bigdatasummerinst.sph.umich.edu/wiki2019/images/6/63/Bdsi_2019_r_practice_dplyr_nycflights_answers.pdf

Data we work on consist of five CSV files that incorporate the following variables:

airlines.csv - Airline carrier code and carrier full names

airports.csv - Airport metadata with

faa - FAA airport code
name - usual name of the airport
lat, long - location of airport as latitude, longitude
alt - altitude (in feet)
tz - timezone offset from GMT
dst - Daylight savings time zone
tzone - IANA time zone, as determined by GeoNames webservice

flights.csv - On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013

year, month, day - date of departure
dep_time, arr_time - actual departure and arrival times (format HHMM or HMM), local time zone
sched_dep_time, sched_arr_time - scheduled departure and arrival times (format HHMM or HMM), local time zone
dep_delay, arr_delay - Departure and arrival delays, in minutes. Negative times represent early departures/arrivals
carrier - two letter carrier abbreviation.
flight - flight number
tailnum - plane tail number
origin, dest - origin and destination
air_time - amount of time spent in the air, in minutes.
distance - distance between airports, in miles.
hour, minute - time of scheduled departure broken into hour and minutes.
time_hour - scheduled date and hour of the flight as a date.

planes.csv - Plane metadata for all plane tailnumbers found in the FAA aircraft registry.

tailnum - Tail number
year - Year manufactured.
type - Type of plane.
manufacturer, model - Manufacturer and model.
engines, seats - Number of engines and seats.
speed - Average cruising speed in mph.
engine - Type of engine.

weather.csv - Hourly meterological data for LGA, JFK and EWR

origin - Weather station location
year, month, day, hour - Time of recording.
temp, dewp- Temperature and dewpoint in F.
humid- Relative humidity.
wind_dir, wind_speed, wind_gust - Wind direction (in degrees), speed and gust speed (in mph).
precip - Precipitation, in inches.
pressure - Sea level pressure in millibars.
visib - Visibility in miles.
time_hour- Date and hour of the weather station recording as a POSIXct date.

3. Data skimming

glimpse(flights)
## Rows: 327,346
## Columns: 20
## $ ID             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <int> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <int> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <int> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <int> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <int> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <int> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <chr> "2013-01-01T10:00:00Z", "2013-01-01T10:00:00Z", "2013-0…
summary(flights)
##        ID              year          month             day       
##  Min.   :     1   Min.   :2013   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 81837   1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00  
##  Median :163674   Median :2013   Median : 7.000   Median :16.00  
##  Mean   :163674   Mean   :2013   Mean   : 6.565   Mean   :15.74  
##  3rd Qu.:245510   3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00  
##  Max.   :327346   Max.   :2013   Max.   :12.000   Max.   :31.00  
##     dep_time    sched_dep_time   dep_delay          arr_time    sched_arr_time
##  Min.   :   1   Min.   : 500   Min.   : -43.00   Min.   :   1   Min.   :   1  
##  1st Qu.: 907   1st Qu.: 905   1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1122  
##  Median :1400   Median :1355   Median :  -2.00   Median :1535   Median :1554  
##  Mean   :1349   Mean   :1340   Mean   :  12.56   Mean   :1502   Mean   :1533  
##  3rd Qu.:1744   3rd Qu.:1729   3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1944  
##  Max.   :2400   Max.   :2359   Max.   :1301.00   Max.   :2400   Max.   :2359  
##    arr_delay          carrier              flight       tailnum         
##  Min.   : -86.000   Length:327346      Min.   :   1   Length:327346     
##  1st Qu.: -17.000   Class :character   1st Qu.: 544   Class :character  
##  Median :  -5.000   Mode  :character   Median :1467   Mode  :character  
##  Mean   :   6.895                      Mean   :1943                     
##  3rd Qu.:  14.000                      3rd Qu.:3412                     
##  Max.   :1272.000                      Max.   :8500                     
##     origin              dest              air_time        distance   
##  Length:327346      Length:327346      Min.   : 20.0   Min.   :  80  
##  Class :character   Class :character   1st Qu.: 82.0   1st Qu.: 509  
##  Mode  :character   Mode  :character   Median :129.0   Median : 888  
##                                        Mean   :150.7   Mean   :1048  
##                                        3rd Qu.:192.0   3rd Qu.:1389  
##                                        Max.   :695.0   Max.   :4983  
##       hour           minute       time_hour        
##  Min.   : 5.00   Min.   : 0.00   Length:327346     
##  1st Qu.: 9.00   1st Qu.: 8.00   Class :character  
##  Median :13.00   Median :29.00   Mode  :character  
##  Mean   :13.14   Mean   :26.23                     
##  3rd Qu.:17.00   3rd Qu.:44.00                     
##  Max.   :23.00   Max.   :59.00
anomalies(flights)
## $variables
##          Variable      q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct
## 1          minute 327346   0   - 58924   18%      0      -    0    -        60
## 2       dep_delay 327346   0   - 16466 5.03%      0      -    0    -       526
## 3       arr_delay 327346   0   -  5409 1.65%      0      -    0    -       577
## 4            year 327346   0   -     0     -      0      -    0    -         1
## 5          origin 327346   0   -     0     -      0      -    0    -         3
## 6           month 327346   0   -     0     -      0      -    0    -        12
## 7         carrier 327346   0   -     0     -      0      -    0    -        16
## 8            hour 327346   0   -     0     -      0      -    0    -        19
## 9             day 327346   0   -     0     -      0      -    0    -        31
## 10           dest 327346   0   -     0     -      0      -    0    -       104
## 11       distance 327346   0   -     0     -      0      -    0    -       213
## 12       air_time 327346   0   -     0     -      0      -    0    -       509
## 13 sched_dep_time 327346   0   -     0     -      0      -    0    -      1020
## 14 sched_arr_time 327346   0   -     0     -      0      -    0    -      1162
## 15       dep_time 327346   0   -     0     -      0      -    0    -      1317
## 16       arr_time 327346   0   -     0     -      0      -    0    -      1410
## 17         flight 327346   0   -     0     -      0      -    0    -      3835
## 18        tailnum 327346   0   -     0     -      0      -    0    -      4037
## 19      time_hour 327346   0   -     0     -      0      -    0    -      6922
## 20             ID 327346   0   -     0     -      0      -    0    -    327346
##         type anomalous_percent
## 1    Integer               18%
## 2    Integer             5.03%
## 3    Integer             1.65%
## 4    Integer                 -
## 5  Character                 -
## 6    Integer                 -
## 7  Character                 -
## 8    Integer                 -
## 9    Integer                 -
## 10 Character                 -
## 11   Integer                 -
## 12   Integer                 -
## 13   Integer                 -
## 14   Integer                 -
## 15   Integer                 -
## 16   Integer                 -
## 17   Integer                 -
## 18 Character                 -
## 19 Character                 -
## 20   Integer                 -
## 
## $problem_variables
##   Variable      q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct    type
## 1     year 327346   0   -     0     -      0      -    0    -         1 Integer
##   anomalous_percent                     problems
## 1                 - Less than 2 distinct values.
distributions(flights)
## ================================================================================

##          Variable     p_1    p_10     p_25     p_50      p_75     p_90
## 1          minute       0       0        8       29        44       55
## 2       dep_delay     -12      -7       -5       -2        11       49
## 3       arr_delay     -44     -26      -17       -5        14       52
## 4            year    2013    2013     2013     2013      2013     2013
## 5           month       1       2        4        7        10       11
## 6            hour       6       7        9       13        17       19
## 7             day       1       4        8       16        23       28
## 8        distance     173     214      509      888      1389     2446
## 9        air_time      33      47       82      129       192      319
## 10 sched_dep_time     600     705      905     1355      1729     1944
## 11 sched_arr_time      38     916     1122     1554      1944     2200
## 12       dep_time     551     703      907     1400      1744     2008
## 13       arr_time      22     853     1104     1535      1940     2158
## 14         flight      11     207      544     1467      3412     4438
## 15             ID 3274.45 32735.5 81837.25 163673.5 245509.75 294611.5
##         p_99
## 1         59
## 2        191
## 3        190
## 4       2013
## 5         12
## 6         22
## 7         31
## 8       2586
## 9        364
## 10      2225
## 11      2353
## 12      2251
## 13      2344
## 14      5736
## 15 324072.55
glimpse(weather)
## Rows: 26,115
## Columns: 15
## $ origin     <chr> "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EW…
## $ year       <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,…
## $ month      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ day        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ hour       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, …
## $ temp       <dbl> 39.02, 39.02, 39.02, 39.92, 39.02, 37.94, 39.02, 39.92, 39.…
## $ dewp       <dbl> 26.06, 26.96, 28.04, 28.04, 28.04, 28.04, 28.04, 28.04, 28.…
## $ humid      <dbl> 59.37, 61.63, 64.43, 62.21, 64.43, 67.21, 64.43, 62.21, 62.…
## $ wind_dir   <int> 270, 250, 240, 250, 260, 240, 240, 250, 260, 260, 260, 330,…
## $ wind_speed <dbl> 10.35702, 8.05546, 11.50780, 12.65858, 12.65858, 11.50780, …
## $ wind_gust  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20.…
## $ precip     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ pressure   <dbl> 1012.0, 1012.3, 1012.5, 1012.2, 1011.9, 1012.4, 1012.2, 101…
## $ visib      <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
## $ time_hour  <chr> "2013-01-01T06:00:00Z", "2013-01-01T07:00:00Z", "2013-01-01…
summary(weather)
##     origin               year          month             day       
##  Length:26115       Min.   :2013   Min.   : 1.000   Min.   : 1.00  
##  Class :character   1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00  
##  Mode  :character   Median :2013   Median : 7.000   Median :16.00  
##                     Mean   :2013   Mean   : 6.504   Mean   :15.68  
##                     3rd Qu.:2013   3rd Qu.: 9.000   3rd Qu.:23.00  
##                     Max.   :2013   Max.   :12.000   Max.   :31.00  
##                                                                    
##       hour            temp             dewp           humid       
##  Min.   : 0.00   Min.   : 10.94   Min.   :-9.94   Min.   : 12.74  
##  1st Qu.: 6.00   1st Qu.: 39.92   1st Qu.:26.06   1st Qu.: 47.05  
##  Median :11.00   Median : 55.40   Median :42.08   Median : 61.79  
##  Mean   :11.49   Mean   : 55.26   Mean   :41.44   Mean   : 62.53  
##  3rd Qu.:17.00   3rd Qu.: 69.98   3rd Qu.:57.92   3rd Qu.: 78.79  
##  Max.   :23.00   Max.   :100.04   Max.   :78.08   Max.   :100.00  
##                  NA's   :1        NA's   :1       NA's   :1       
##     wind_dir       wind_speed         wind_gust         precip        
##  Min.   :  0.0   Min.   :   0.000   Min.   :16.11   Min.   :0.000000  
##  1st Qu.:120.0   1st Qu.:   6.905   1st Qu.:20.71   1st Qu.:0.000000  
##  Median :220.0   Median :  10.357   Median :24.17   Median :0.000000  
##  Mean   :199.8   Mean   :  10.518   Mean   :25.49   Mean   :0.004469  
##  3rd Qu.:290.0   3rd Qu.:  13.809   3rd Qu.:28.77   3rd Qu.:0.000000  
##  Max.   :360.0   Max.   :1048.361   Max.   :66.75   Max.   :1.210000  
##  NA's   :460     NA's   :4          NA's   :20778                     
##     pressure          visib         time_hour        
##  Min.   : 983.8   Min.   : 0.000   Length:26115      
##  1st Qu.:1012.9   1st Qu.:10.000   Class :character  
##  Median :1017.6   Median :10.000   Mode  :character  
##  Mean   :1017.9   Mean   : 9.255                     
##  3rd Qu.:1023.0   3rd Qu.:10.000                     
##  Max.   :1042.1   Max.   :10.000                     
##  NA's   :2729
anomalies(weather)
## $variables
##      Variable     q   qNA    pNA qZero pZero qBlank pBlank qInf pInf qDistinct
## 1      precip 26115     0      - 24366 93.3%      0      -    0    -        59
## 2   wind_gust 26115 20778 79.56%     0     -      0      -    0    -        38
## 3    pressure 26115  2729 10.45%     0     -      0      -    0    -       469
## 4    wind_dir 26115   460  1.76%  1256 4.81%      0      -    0    -        38
## 5  wind_speed 26115     4  0.02%  1256 4.81%      0      -    0    -        37
## 6        hour 26115     0      -  1075 4.12%      0      -    0    -        24
## 7       visib 26115     0      -    10 0.04%      0      -    0    -        20
## 8        dewp 26115     1     0%     0     -      0      -    0    -       154
## 9        temp 26115     1     0%     0     -      0      -    0    -       174
## 10      humid 26115     1     0%     0     -      0      -    0    -      2500
## 11       year 26115     0      -     0     -      0      -    0    -         1
## 12     origin 26115     0      -     0     -      0      -    0    -         3
## 13      month 26115     0      -     0     -      0      -    0    -        12
## 14        day 26115     0      -     0     -      0      -    0    -        31
## 15  time_hour 26115     0      -     0     -      0      -    0    -      8714
##         type anomalous_percent
## 1    Numeric             93.3%
## 2    Numeric            79.56%
## 3    Numeric            10.45%
## 4    Integer             6.57%
## 5    Numeric             4.82%
## 6    Integer             4.12%
## 7    Numeric             0.04%
## 8    Numeric                0%
## 9    Numeric                0%
## 10   Numeric                0%
## 11   Integer                 -
## 12 Character                 -
## 13   Integer                 -
## 14   Integer                 -
## 15 Character                 -
## 
## $problem_variables
##   Variable     q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct    type
## 1   precip 26115   0   - 24366 93.3%      0      -    0    -        59 Numeric
## 2     year 26115   0   -     0     -      0      -    0    -         1 Integer
##   anomalous_percent                                problems
## 1             93.3% Anomalies present in 93.3% of the rows.
## 2                 -            Less than 2 distinct values.
distributions(weather)
## ================================================================================

##      Variable     p_1    p_10   p_25    p_50    p_75    p_90     p_99
## 1      precip       0       0      0       0       0       0     0.13
## 2   wind_gust 16.1109 18.4125 20.714 24.1664 28.7695 33.3726  43.7296
## 3    pressure  1001.3  1008.5 1012.9  1017.6    1023  1027.5 1036.315
## 4    wind_dir       0      30    120     220     290     330      360
## 5  wind_speed       0  4.6031 6.9047  10.357 13.8094 18.4125  26.4679
## 6        hour       0       2      6      11      17      21       23
## 7       visib     0.5       7     10      10      10      10       10
## 8        dewp    1.04   15.08  26.06   42.08   57.92   66.92    73.04
## 9        temp   19.94      32  39.92    55.4   69.98    78.8    91.04
## 10      humid   23.39   37.46  47.05   61.79   78.79   89.57      100
## 11       year    2013    2013   2013    2013    2013    2013     2013
## 12      month       1       2      4       7       9      11       12
## 13        day       1       4      8      16      23      28       31
glimpse(airlines)
## Rows: 16
## Columns: 2
## $ carrier <chr> "9E", "AA", "AS", "B6", "DL", "EV", "F9", "FL", "HA", "MQ", "O…
## $ name    <chr> "Endeavor Air Inc.", "American Airlines Inc.", "Alaska Airline…
summary(airlines)
##    carrier              name          
##  Length:16          Length:16         
##  Class :character   Class :character  
##  Mode  :character   Mode  :character
anomalies(airlines)
## $variables
##   Variable  q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct      type
## 1  carrier 16   0   -     0     -      0      -    0    -        16 Character
## 2     name 16   0   -     0     -      0      -    0    -        16 Character
##   anomalous_percent
## 1                 -
## 2                 -
## 
## $problem_variables
##  [1] Variable          q                 qNA               pNA              
##  [5] qZero             pZero             qBlank            pBlank           
##  [9] qInf              pInf              qDistinct         type             
## [13] anomalous_percent problems         
## <0 rows> (or 0-length row.names)
glimpse(planes)
## Rows: 3,322
## Columns: 9
## $ tailnum      <chr> "N10156", "N102UW", "N103US", "N104UW", "N10575", "N105UW…
## $ year         <int> 2004, 1998, 1999, 1999, 2002, 1999, 1999, 1999, 1999, 199…
## $ type         <chr> "Fixed wing multi engine", "Fixed wing multi engine", "Fi…
## $ manufacturer <chr> "EMBRAER", "AIRBUS INDUSTRIE", "AIRBUS INDUSTRIE", "AIRBU…
## $ model        <chr> "EMB-145XR", "A320-214", "A320-214", "A320-214", "EMB-145…
## $ engines      <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ seats        <int> 55, 182, 182, 182, 55, 182, 182, 182, 182, 182, 55, 55, 5…
## $ speed        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ engine       <chr> "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turb…
summary(planes)
##    tailnum               year          type           manufacturer      
##  Length:3322        Min.   :1956   Length:3322        Length:3322       
##  Class :character   1st Qu.:1997   Class :character   Class :character  
##  Mode  :character   Median :2001   Mode  :character   Mode  :character  
##                     Mean   :2000                                        
##                     3rd Qu.:2005                                        
##                     Max.   :2013                                        
##                     NA's   :70                                          
##     model              engines          seats           speed      
##  Length:3322        Min.   :1.000   Min.   :  2.0   Min.   : 90.0  
##  Class :character   1st Qu.:2.000   1st Qu.:140.0   1st Qu.:107.5  
##  Mode  :character   Median :2.000   Median :149.0   Median :162.0  
##                     Mean   :1.995   Mean   :154.3   Mean   :236.8  
##                     3rd Qu.:2.000   3rd Qu.:182.0   3rd Qu.:432.0  
##                     Max.   :4.000   Max.   :450.0   Max.   :432.0  
##                                                     NA's   :3299   
##     engine         
##  Length:3322       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
anomalies(planes)
## $variables
##       Variable    q  qNA    pNA qZero pZero qBlank pBlank qInf pInf qDistinct
## 1        speed 3322 3299 99.31%     0     -      0      -    0    -        14
## 2         year 3322   70  2.11%     0     -      0      -    0    -        47
## 3         type 3322    0      -     0     -      0      -    0    -         3
## 4      engines 3322    0      -     0     -      0      -    0    -         4
## 5       engine 3322    0      -     0     -      0      -    0    -         6
## 6 manufacturer 3322    0      -     0     -      0      -    0    -        35
## 7        seats 3322    0      -     0     -      0      -    0    -        48
## 8        model 3322    0      -     0     -      0      -    0    -       127
## 9      tailnum 3322    0      -     0     -      0      -    0    -      3322
##        type anomalous_percent
## 1   Integer            99.31%
## 2   Integer             2.11%
## 3 Character                 -
## 4   Integer                 -
## 5 Character                 -
## 6 Character                 -
## 7   Integer                 -
## 8 Character                 -
## 9 Character                 -
## 
## $problem_variables
##   Variable    q  qNA    pNA qZero pZero qBlank pBlank qInf pInf qDistinct
## 1    speed 3322 3299 99.31%     0     -      0      -    0    -        14
##      type anomalous_percent                                 problems
## 1 Integer            99.31% Anomalies present in 99.31% of the rows.
distributions(planes)
## ================================================================================

##   Variable  p_1 p_10  p_25 p_50 p_75 p_90 p_99
## 1    speed   90   97 107.5  162  432  432  432
## 2     year 1984 1990  1997 2001 2005 2009 2013
## 3  engines    2    2     2    2    2    2    2
## 4    seats 9.21   55   140  149  182  200  379
glimpse(airports)
## Rows: 1,458
## Columns: 8
## $ faa   <chr> "04G", "06A", "06C", "06N", "09J", "0A9", "0G6", "0G7", "0P2", "…
## $ name  <chr> "Lansdowne Airport", "Moton Field Municipal Airport", "Schaumbur…
## $ lat   <dbl> 41.13047, 32.46057, 41.98934, 41.43191, 31.07447, 36.37122, 41.4…
## $ lon   <dbl> -80.61958, -85.68003, -88.10124, -74.39156, -81.42778, -82.17342…
## $ alt   <int> 1044, 264, 801, 523, 11, 1593, 730, 492, 1000, 108, 409, 875, 10…
## $ tz    <int> -5, -6, -6, -5, -5, -5, -5, -5, -5, -8, -5, -6, -5, -5, -5, -5, …
## $ dst   <chr> "A", "A", "A", "A", "A", "A", "A", "A", "U", "A", "A", "U", "A",…
## $ tzone <chr> "America/New_York", "America/Chicago", "America/Chicago", "Ameri…
summary(airports)
##      faa                name                lat             lon         
##  Length:1458        Length:1458        Min.   :19.72   Min.   :-176.65  
##  Class :character   Class :character   1st Qu.:34.26   1st Qu.:-119.19  
##  Mode  :character   Mode  :character   Median :40.09   Median : -94.66  
##                                        Mean   :41.65   Mean   :-103.39  
##                                        3rd Qu.:45.07   3rd Qu.: -82.52  
##                                        Max.   :72.27   Max.   : 174.11  
##       alt                tz              dst               tzone          
##  Min.   : -54.00   Min.   :-10.000   Length:1458        Length:1458       
##  1st Qu.:  70.25   1st Qu.: -8.000   Class :character   Class :character  
##  Median : 473.00   Median : -6.000   Mode  :character   Mode  :character  
##  Mean   :1001.42   Mean   : -6.519                                        
##  3rd Qu.:1062.50   3rd Qu.: -5.000                                        
##  Max.   :9078.00   Max.   :  8.000
anomalies(airports)
## $variables
##   Variable    q qNA   pNA qZero pZero qBlank pBlank qInf pInf qDistinct
## 1      alt 1458   0     -    51  3.5%      0      -    0    -       911
## 2    tzone 1458   3 0.21%     0     -      0      -    0    -        10
## 3      dst 1458   0     -     0     -      0      -    0    -         3
## 4       tz 1458   0     -     0     -      0      -    0    -         7
## 5     name 1458   0     -     0     -      0      -    0    -      1440
## 6      lat 1458   0     -     0     -      0      -    0    -      1456
## 7      faa 1458   0     -     0     -      0      -    0    -      1458
## 8      lon 1458   0     -     0     -      0      -    0    -      1458
##        type anomalous_percent
## 1   Integer              3.5%
## 2 Character             0.21%
## 3 Character                 -
## 4   Integer                 -
## 5 Character                 -
## 6   Numeric                 -
## 7 Character                 -
## 8   Numeric                 -
## 
## $problem_variables
##  [1] Variable          q                 qNA               pNA              
##  [5] qZero             pZero             qBlank            pBlank           
##  [9] qInf              pInf              qDistinct         type             
## [13] anomalous_percent problems         
## <0 rows> (or 0-length row.names)
distributions(airports)
## ================================================================================

##   Variable       p_1      p_10      p_25     p_50     p_75     p_90     p_99
## 1      alt         0        15     70.25      473   1062.5     2906  6841.09
## 2       tz       -10        -9        -8       -6       -5       -5       -5
## 3      lat   21.5382   30.4803   34.2575  40.0877  45.0671  59.9414  67.6392
## 4      lon -166.3004 -154.8695 -119.1857 -94.6619 -82.5167 -76.0951 -69.9471

With the focus of the analysis being departure delay, when the distribution() function is applied to the flight data set, it is evident that the distribution of departure delays is significantly right-skewed. However, since the objective of the work requires thorough analysis of the departure delay,we intend to keep the skewness. When it comes to departure delay,the primary focus will be on the ones with a positive duration (flights that departed late).

One point to note is that the flights data set, which contains departure delay times, is fairly clean with no null values found. However, the same can’t be said for other data sets, having varying ranges of null values. The handling of the null values is done on case by case basis.

Having considered the above said, we now move onto to the in depth analysis of the exploratory questions based on the data set

Question 1 : Is there a pattern to the departure delay in terms of time? (Month, Day of week and Hour)

Arrival Delay vs Departure Delay

Departure delay = Actual departure time − Scheduled departure time

Arrival delay = Actual arrival time − Scheduled arrival time.

We see that a positive relationship exists between dep_delay and arr_delay: as departure delays increase, arrival delays tend to also increase.In a general sense this means that the later a plane departs, typically the later it will arrive.

In the graph below, there is a cluster of points near (0, 0).The point (0,0) means no delay in departure and arrival. From the passenger’s point of view, this means the flight was on time. It seems most flights are at least close to being on time in all the origin airports [EWR,JFK,LGA].

We can also observe large positive values of dep_delay which may be due to many factors such as adverse weather conditions. In such cases flights will be required to take off or land at airports with more restrictions. As a result, there may be an increase in departure or arrival delays.

flights%>%
  ggplot()+ aes(x = dep_delay, y = arr_delay,color=origin) + 
  geom_point(alpha = 0.2)+labs(x="Departure Delay (in minutes)", y="Arrival delay (in minutes)", title =  "Arrival Delay vs Departure Dealy ")+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

In our study, we are looking for patterns in the delays experienced by flights departing from New York City.

Setting up global variable for flights

flights_seasonal <- flights %>% 
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  mutate(season = ifelse(month %in% 9:11, "Fall",
                 ifelse(month %in% 6:8, "Summer",
                 ifelse(month %in% 3:5, "Spring","Winter")))) %>%
  mutate(month=factor(month,levels=1:12,labels=c("Jan","Feb","Mar","Apr","May",
                                                 "Jun","Jul","Aug","Sep","Oct","Nov","Dec"),ordered=TRUE))%>%
  mutate(date = ymd(paste(year, month, day))) %>%
  mutate(date = ymd(paste(year, month, day)),dayofweek=weekdays(date)) %>%
  mutate(day_of_week=factor(dayofweek,levels = c("Sunday","Monday","Tuesday","Wednesday",
                                                 "Thursday","Friday","Saturday"),
                            labels=c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"),
                            ordered=TRUE))

head(flights_seasonal)
##   ID year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1  1 2013   Jan   1      517            515         2      830            819
## 2  2 2013   Jan   1      533            529         4      850            830
## 3  3 2013   Jan   1      542            540         2      923            850
## 4 20 2013   Jan   1      601            600         1      844            850
## 5 26 2013   Jan   1      608            600         8      807            735
## 6 27 2013   Jan   1      611            600        11      945            931
##   arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
## 2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
## 3        33      AA   1141  N619AA    JFK  MIA      160     1089    5     40
## 4        -6      B6    343  N644JB    EWR  PBI      147     1023    6      0
## 5        32      MQ   3768  N9EAMQ    EWR  ORD      139      719    6      0
## 6        14      UA    303  N532UA    JFK  SFO      366     2586    6      0
##              time_hour season       date dayofweek day_of_week
## 1 2013-01-01T10:00:00Z Winter 2013-01-01   Tuesday         Tue
## 2 2013-01-01T10:00:00Z Winter 2013-01-01   Tuesday         Tue
## 3 2013-01-01T10:00:00Z Winter 2013-01-01   Tuesday         Tue
## 4 2013-01-01T11:00:00Z Winter 2013-01-01   Tuesday         Tue
## 5 2013-01-01T11:00:00Z Winter 2013-01-01   Tuesday         Tue
## 6 2013-01-01T11:00:00Z Winter 2013-01-01   Tuesday         Tue
flights_delayed_flight <- flights_seasonal %>% 
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
   select(carrier,hour, origin,month, season, dep_delay,time_hour)
head(flights_delayed_flight)
##   carrier hour origin month season dep_delay            time_hour
## 1      UA    5    EWR   Jan Winter         2 2013-01-01T10:00:00Z
## 2      UA    5    LGA   Jan Winter         4 2013-01-01T10:00:00Z
## 3      AA    5    JFK   Jan Winter         2 2013-01-01T10:00:00Z
## 4      B6    6    EWR   Jan Winter         1 2013-01-01T11:00:00Z
## 5      MQ    6    EWR   Jan Winter         8 2013-01-01T11:00:00Z
## 6      UA    6    JFK   Jan Winter        11 2013-01-01T11:00:00Z

The below code shows the mean value of the departure delay

paste(flights%>%
       filter(dep_delay > 0,dep_delay < quantile(dep_delay, 0.99))%>%
       summarise(mean(dep_delay)))
## [1] "33.4432946620818"

1.Trend of Average Departure Delay by Hour

flights_seasonal%>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(hour) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot()+ aes(x = as.numeric(hour), y = avg_dep_delay,color =hour)+geom_point()+scale_x_continuous(breaks = c(5,11,17,22), labels = function(x){case_when(x == 5 ~ '5am', x == 11 ~ '11am', x == 17 ~ '5pm', x == 22 ~ '10pm')})+
  geom_smooth(position = "identity")+labs(x="Hours",y="Average Departure Delay (in minutes)",title="Average delay for each hour")+
 theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

The above graph explains why you are more likely to be delayed if you fly later in the day rather than in the morning.

Following the data, the preferred time to fly is between 5 a.m. and 8 a.m. to avoid delays since the average departure delay is approximately 20 minutes

Minimizing departure delays in flights early in the day is also beneficial for that flight and subsequent flights, by reducing the propagation of delay between consecutive flights.

It is evident that the trend is increasing, with the average departure delay exceeding the mean departure delay between 3 pm and 12 pm.

flights_seasonal %>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(hour,carrier) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot() +
  aes(x = hour, y = avg_dep_delay,fill=avg_dep_delay>=33.44) +
  geom_col(alpha=0.7)+labs(x="Hours",y="Average Departure Delay (in minutes)",
       title="Average delay for each hour facet by carrier ") +
  geom_hline(aes(yintercept=33.44),linetype = 2)+
  facet_wrap(carrier~.,ncol = 4)+theme_bw()

According to our observations, most of the carriers have a ‘propogation’ delay, which means that the average delay increases as we approach the day’s end.

flights_seasonal %>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(hour,origin) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot() +
  aes(x = hour, y = avg_dep_delay,fill=avg_dep_delay>=33.44) +
  geom_col(alpha=0.7)+labs(x="Hours",y="Average Departure Delay (in minutes)",
       title="Average delay for each hour facet by origin") +
  geom_hline(aes(yintercept=33.44),linetype = 2)+
  facet_wrap(origin~.)+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

According to our observations, the origin airports also have delay propagation, which means that the average departure delay at a flight stage causes a ripple effect in the subsequent stages of a flight, which in turn means that the average departure delay increases as the day goes on.

2.Seasonal Trend of Average Departure Delay by Hour in each season: Fall, Summer, Spring, Winter

The following graph shows the Average Departure Delay by in each season: Fall, Summer, Spring, Winter

flights_seasonal %>% 
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(month, season) %>%  
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x =factor(month), y = avg_dep_delay, group=season, fill=season)) + 
  geom_col() +labs(x="Months", y="Average Depature delay (in minutes)", title =  "Average Departure Delay vs Seasons ")+geom_hline(aes(yintercept=33.443),linetype = 2)+geom_text(aes( 10, 33.443+2, label="Avg Dep_delay(min)"), size = 3 , color="black")+
  theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

New York’s climate is classed as being continental, which means that it receives four distinct seasons spring (March-May), summer (June-August), autumn (September-November) and winter (December-February).

Thus, we can conclude that the average departure delay exceeded the mean average departure delay during the Summer (June and July),peak season. The high rate of tourism during the summer season might be the cause for the high average departure delay .

flights_seasonal %>% 
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(hour, season) %>%  
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x =hour, y = avg_dep_delay, group=season, color=season)) + 
  geom_line(lwd = 2) +labs(x="Hour", y="Average Depature delay (in minutes)", title =  "Seasonal Trend of Average Departure Delay by Hour ")+geom_hline(aes(yintercept=33.443),linetype = 2)+geom_text(aes( 10, 33.443+2, label="Avg Dep_delay(min)"), size = 3 , color="black")+
  theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

The above graph shows the Seasonal Trend of Average Departure Delay by Hour in each season: Fall, Summer, Spring,Winter.It is evident that the trend for the summer and spring is changing rapidly in each hour.

 flights_delayed_flight %>%
  ggplot(aes(hour,color=season))+
  geom_freqpoly(binwidth = 1,lwd=2)  +
  ggtitle("Seasonal trend of Number of Delayed flight by hour") +theme(plot.title = element_text(hjust = 0.5))+labs(x="Hour", y="Number of Delayed flights ", title =  "Seasonal Trend of Average Departure Delay by Hour ")+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

A comparison of the total number of delayed flights per hour by season indicates that despite having a similar trend each season, summer and spring have the highest number of delayed flights.

flights %>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  mutate(date = ymd(paste(year, month, day))) %>%
  group_by(date) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot() +
  aes(x = date, y = avg_dep_delay, color=date) +
  geom_point()+geom_smooth(position = "identity")+labs(x="Dates", y="Average Depature delay (in minutes)", title =  "Average Depature Delay vs Dates ")+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

The total number of delayed flights per month increases during the spring and summer i.e from April to July and during the winter, specifically in December.

3.Seasonal Trend of Average Departure Delay vs Number of Delayed flight

flights %>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(month,origin) %>%
  summarise(count=n())%>%
  ggplot(aes(x=month,y=count))+
  geom_line(color="#00AFBB",lwd=2)+geom_point(size=2)+
  scale_x_discrete(limits=1:12)+
  labs(x="Month",y="Number of Departure Delays",title="Number of Departure Delays vs Month")+
  facet_wrap(origin~.)+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        panel.background = element_blank()
        )

In each origin, the three curves exhibit a similar trend. Although they are very similar, we see that number of delayed flights in LGA are lower than in other airports.

flights_seasonal %>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(month)%>%
  summarise(Number_of_delayed_flight_in_1000=n()/1000,avg_dep_delay = mean(dep_delay, na.rm = TRUE))%>%

  ggplot(aes(x=month,y=avg_dep_delay,fill=Number_of_delayed_flight_in_1000)) +geom_bar(stat = "identity") +geom_hline(aes(yintercept=33.44),linetype = 2)+labs(x="Month",y="Average Departure Delay (in minutes)",title="Average delay for each day of the month ")+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

Among the months, September has the lowest number of delayed flights around 8000, and June and July have the highest number. As stated earlier, June and July experience the highest average delays in departures. During December, flight delays are on the rise with an increase in the number of flights delayed.

4.Average Departure Delay vs Day of week

flights_seasonal %>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(day_of_week)%>%
  summarise(Number_of_delayed_flight_in_1000=n()/1000,avg_dep_delay = mean(dep_delay, na.rm = TRUE))%>%

  ggplot(aes(x=day_of_week,y=avg_dep_delay,fill=Number_of_delayed_flight_in_1000)) +geom_bar(stat = "identity") +geom_hline(aes(yintercept=33.44),linetype = 2)+labs(x="Day of the week",y="Average Departure Delay (in minutes)",title="Average delay for each day of the week ")+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

From the above plot, it appears that Saturday that the lowest average departure delay and lowest number of delayed flights (approximately 14000). Thursday and Friday have the most number of delayed flights, approximately 20000.

Another point to note here is that Monday and Thursday have the highest average departure delays

Question 2 : How does weather impact flights from NYC? What is the effect of weather on departure delay?

Setting up global variable for weather

data_fw <- flights %>% 
  inner_join(weather, by = c("origin", "time_hour","month","hour"))%>%
    mutate(count_delayed = ifelse(dep_delay > 0, 1, 0))%>%
    mutate(season = ifelse(month %in% 9:11, "Fall",
                 ifelse(month %in% 6:8, "Summer",
                 ifelse(month %in% 3:5, "Spring","Winter")))) 
    
glimpse(data_fw)
## Rows: 325,819
## Columns: 33
## $ ID             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ year.x         <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day.x          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <int> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <int> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <int> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <int> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <int> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <int> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <chr> "2013-01-01T10:00:00Z", "2013-01-01T10:00:00Z", "2013-0…
## $ year.y         <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ day.y          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ temp           <dbl> 39.02, 39.92, 39.02, 39.02, 39.92, 39.02, 37.94, 39.92,…
## $ dewp           <dbl> 28.04, 24.98, 26.96, 26.96, 24.98, 28.04, 28.04, 24.98,…
## $ humid          <dbl> 64.43, 54.81, 61.63, 61.63, 54.81, 64.43, 67.21, 54.81,…
## $ wind_dir       <int> 260, 250, 260, 260, 260, 260, 240, 260, 260, 260, 260, …
## $ wind_speed     <dbl> 12.65858, 14.96014, 14.96014, 14.96014, 16.11092, 12.65…
## $ wind_gust      <dbl> NA, 21.86482, NA, NA, 23.01560, NA, NA, 23.01560, NA, 2…
## $ precip         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pressure       <dbl> 1011.9, 1011.4, 1012.1, 1012.1, 1011.7, 1011.9, 1012.4,…
## $ visib          <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
## $ count_delayed  <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ season         <chr> "Winter", "Winter", "Winter", "Winter", "Winter", "Wint…

We use a correlation matrix to understand what variables might be most correlated to dep-delay.

Variables are:

1.temp

2.dewp

3.humid

4.precip

5.pressure

6.visibility

According to the correlation plot, here are few inferences

1.High relative humidity results in low visibility 2.High relative humidity results in precipitation .The higher the humidity the greater the water vapor, and the more rain we’re likely to see. 3.Since dewp and temp are highly correlated, we will only investigate one of them

data_fw <-
  
  cor_data <- select(data_fw, dep_delay, temp, dewp, humid,precip, pressure, visib)%>%
  na.omit%>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99))

#WE first plot a correlation Matrix using corrplot to find the variables that
#are correlated. We create a correlation matrix using 'cor' function
corrplot(cor(na.omit(cor_data)), method = "square")

data_fw <- flights %>% 
  inner_join(weather, by = c("origin", "time_hour","month","hour"))%>%
    mutate(count_delayed = ifelse(dep_delay > 0, 1, 0))%>%
    mutate(season = ifelse(month %in% 9:11, "Fall",
                 ifelse(month %in% 6:8, "Summer",
                 ifelse(month %in% 3:5, "Spring","Winter")))) 
head(data_fw)
##   ID year.x month day.x dep_time sched_dep_time dep_delay arr_time
## 1  1   2013     1     1      517            515         2      830
## 2  2   2013     1     1      533            529         4      850
## 3  3   2013     1     1      542            540         2      923
## 4  4   2013     1     1      544            545        -1     1004
## 5  5   2013     1     1      554            600        -6      812
## 6  6   2013     1     1      554            558        -4      740
##   sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance
## 1            819        11      UA   1545  N14228    EWR  IAH      227     1400
## 2            830        20      UA   1714  N24211    LGA  IAH      227     1416
## 3            850        33      AA   1141  N619AA    JFK  MIA      160     1089
## 4           1022       -18      B6    725  N804JB    JFK  BQN      183     1576
## 5            837       -25      DL    461  N668DN    LGA  ATL      116      762
## 6            728        12      UA   1696  N39463    EWR  ORD      150      719
##   hour minute            time_hour year.y day.y  temp  dewp humid wind_dir
## 1    5     15 2013-01-01T10:00:00Z   2013     1 39.02 28.04 64.43      260
## 2    5     29 2013-01-01T10:00:00Z   2013     1 39.92 24.98 54.81      250
## 3    5     40 2013-01-01T10:00:00Z   2013     1 39.02 26.96 61.63      260
## 4    5     45 2013-01-01T10:00:00Z   2013     1 39.02 26.96 61.63      260
## 5    6      0 2013-01-01T11:00:00Z   2013     1 39.92 24.98 54.81      260
## 6    5     58 2013-01-01T10:00:00Z   2013     1 39.02 28.04 64.43      260
##   wind_speed wind_gust precip pressure visib count_delayed season
## 1   12.65858        NA      0   1011.9    10             1 Winter
## 2   14.96014  21.86482      0   1011.4    10             1 Winter
## 3   14.96014        NA      0   1012.1    10             1 Winter
## 4   14.96014        NA      0   1012.1    10             0 Winter
## 5   16.11092  23.01560      0   1011.7    10             0 Winter
## 6   12.65858        NA      0   1011.9    10             0 Winter

This section we will use two measures to understand the relationship between departure delay and the weather variables.

1.Delay Percent: The number of delayed flight over total number of flights with respect to different value of the weather variable

2.Average Departure delay: The average departure delay of all delayed flights with respect to different value of the weather variable

1. Precipitation

data_fw %>%
  filter(!is.na(precip)) %>%
  group_by (month,season,origin) %>%
  summarise(avg_precip = mean(precip, na.rm = TRUE)) %>%
  ggplot()+ aes(x = factor(month), y =avg_precip,fill=season)+geom_col(position = "identity")+theme_bw()+
  scale_x_discrete(limits=1:12)+
  labs(x="Month",y="Average precipitataion (in inches)",title="Number of departure delays vs Month")+
  theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())+facet_wrap(origin~.)

We see from the graph that there is a high average precipitation during the months of June, July in all the origins. The highest average precipitation being 1.2 inches in EWR in June.

data_fw%>%
  filter(!is.na(precip ))%>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(precip) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  
  ggplot()+ aes(x =precip,y=avg_dep_delay,color = avg_dep_delay) + geom_point()+
  geom_smooth()+labs(x="Precipitation (in inches)", y="Average Depature delay (in minutes)", title = "Precipitation vs Average Depature Delay")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

The average departure delay increases as the precipitation increases from the value of 0.2 inches as seen in the graph above.

data_fw %>%
  filter(precip > 0, precip < quantile(precip, 0.99)) %>%
 
  group_by(precip,season)%>%
  summarise(count_delay = sum(count_delayed),
            count = n())%>%
  

  ggplot(aes(x=precip,y=(100*(count_delay/count)))) +geom_line(stat = "identity",lwd=2,color="#00AFBB") +labs(x="Precipitation (in inches) ",y="Delay Percent (%)",title="Delay Percent (%) vs Precipitation (in inches) ")+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank()
        )+facet_wrap(season~.)

We also see that delay percent is higher during the summer ranging from 60% to 80 %. And there is a decreasing trend in winter and an increasing trend in fall.

2. Humidity

Relative humidity is usually high at midnight and in the early morning and it drops rapidly after the sun rises, until it is lowest just after midday. It then increases again till midnight. A correlation exists between relative humidity and average delay like precipitation, as we saw that the higher the relative humidity, the greater the chance of rain.

data_fw %>%
  filter(!is.na(humid)) %>%
  group_by(hour,season) %>%
  summarise(avg_humid = mean(humid, na.rm = TRUE)) %>%
  ggplot()+ aes(x =hour, y = avg_humid, color=season)+geom_line(position = "identity",lwd=2)+theme_bw()+
  
  labs(x="Hours",y="Relative Humidity (%) ",title="Average Humidity vs Hours")+
  theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

data_fw%>%
  filter(!is.na(humid))%>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(humid) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  
  ggplot()+ aes(x =humid,y=avg_dep_delay,color = avg_dep_delay) +
  geom_smooth()+labs(x="Relative humidity (%)", y="Average Depature delay (in minutes)", title = "Relative humidity. vs Average Depature Delay")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

The average departure delay increases as the relative humidity increases as you see in the graph above.

data_fw %>%
  filter(!is.na(humid))%>%
  filter(humid > 0, humid < quantile(humid, 0.99)) %>%
 
  group_by(humid,hour)%>%
  summarise(count_delay = sum(count_delayed),
            count = n())%>%
  

  ggplot(aes(x=humid,y=(100*(count_delay/count)))) +geom_boxplot(fill="#00AFBB") +labs(x="Relative humidity ",y="Delay Percent (%)",title="Delay Percent (%) vs Relative humidity ")+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank()
        )+facet_wrap(hour~.)+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

We see a higher delay percentage during 5 a.m. and from 21 to 23 p.m. due to humidity variations within a day.

3. Visibility

Visibility is estimated from the intensity of scattered light, which decreases when there are more fog droplets, smoke or haze particles, raindrops or snowflakes in the beam.

From the below graph we see that the visibility is low in the winter seasons as a result of fog droplets, smoke or haze particles, raindrops or snowflakes in the beam.

data_fw %>%
  filter(!is.na(visib)) %>%
  group_by(hour,season) %>%
  summarise(avg_visib = mean(visib, na.rm = TRUE)) %>%
  ggplot()+ aes(x =hour, y = avg_visib, color=season)+geom_line(position = "identity",lwd=2)+theme_bw()+
  
  labs(x="Hours",y="Average Visibility (in miles) ",title="Average Visibility vs Hours in each seasons")+
  theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

data_fw%>%
  filter(!is.na(visib))%>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(visib) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  
  ggplot()+ aes(x =visib,y=avg_dep_delay,color = avg_dep_delay) +geom_point()+
  geom_smooth()+labs(x=" Visibility (in miles)", y="Average Depature delay (in minutes)", title = " Visibility (in miles). vs Average Depature Delay")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

Visibility is one of the main reasons for departure delay, better visibility helps decrease separation distance during take-off sequence or landing queue which contributes to reducing departure delays. Here the separation distance is the distance between the current aircraft and the preceding aircraft in the same runway.

Low visibility leads to increasing take-off or landing separations, and this further reduces the airport’s capacity which is then likely to result in departure or arrival delays

data_fw %>%
  filter(visib > 0, visib < quantile(visib, 0.99)) %>%
 
  group_by(visib,season)%>%
  summarise(count_delay = sum(count_delayed),
            count = n())%>%
  

  ggplot(aes(x=visib,y=(100*(count_delay/count)))) +geom_violin(trim=FALSE,color="#00AFBB") +labs(x="Visibility (in miles) ",y="Delay Percent (%)",title="Delay Percent (%) vs Visibility (in miles) ")+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank()
        )+facet_wrap(season~.)

According to the violin plot, there is a a high distribution of delay percent during the the winter months. Wider sections of the violin plot represent a distribution of delay percent which implies a significant proportion of delayed flights.

4. Pressure

When the air pressure is high, the air molecules become more tightly packed and denser. Aircraft performance depends on this pressure. The propeller is more effective when it is pushing more air molecules to produce thrust. The wing generates more lift when it is pushing more air molecules downwards. Hence this may result in low average departure delay since the take-off is easier.

In the weather data set, the pressure parameter has null values. The data cleaning process here involves the removal of the null values using the na.rm() function

data_fw %>%
  filter(!is.na(pressure)) %>%
  group_by(hour,season) %>%
  summarise(avg_pressure = mean(pressure, na.rm = TRUE)) %>%
  ggplot()+ aes(x =hour, y = avg_pressure, color=season)+geom_line(position = "identity",lwd=2)+theme_bw()+
  
  labs(x="Hours",y="Average Pressure (in millibars) ",title="Average Pressure vs Hours in each seasons")+
  theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

Air pressure is lowest during the summer season because the temperature is high since warm air is less dense than cold air. As the density of the air increases (high pressure), aircraft performance increases; conversely as air density decreases (low pressure ), aircraft performance decreases.

data_fw%>%
  filter(!is.na(pressure))%>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(pressure) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  
  ggplot()+ aes(x =pressure,y=avg_dep_delay,color = avg_dep_delay) +
  geom_smooth()+labs(x=" Pressure (in millibars)", y="Average Depature delay (in minutes)", title = " Pressure (in millibars) vs Average Depature Delay")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

From the above graph it can be seen that the average departure delay decreases as the pressure increases ,approximately from a value around 980 millibars.

data_fw %>%
  
  
  filter(!is.na(pressure))%>%
  
  group_by(pressure,season)%>%
  summarise(count_delay = sum(count_delayed),
            count = n())%>%
  

  ggplot(aes(x=pressure,y=(100*(count_delay/count)))) +geom_boxplot(trim=FALSE,color="#00AFBB") +labs(x="Pressure (in millibars)",y="Delay Percent (%)",title="Delay Percent (%) vs Pressure (in millibars) ")+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank()
        )+facet_wrap(season~.)

We also see that delay percent is higher during the summer ranging from 30% to 60 %. And there is a low distribution of delay percent in winter and fall.

5. Temperature

As far as temperature is concerned, we assume that both hot and cold temperatures present adverse weather conditions and affect flight delays. In contrast, hot temperatures adversely affect aircraft engine performance, whereas cold temperatures are often associated with foggy and snowy days, which may result in poor airport surface performance and, as a consequence, adversely affect flight delays as well.

data_fw %>%
  filter(!is.na(temp)) %>%
  group_by(month,season) %>%
  summarise(avg_temp = mean(temp, na.rm = TRUE)) %>%
  ggplot()+ aes(x = factor(month), y =avg_temp,fill=season)+geom_col(position = "identity")+theme_bw()+
  scale_x_discrete(limits=1:12)+
  labs(x="Month",y="Average Temperature (in F) ",title="Average Temperature (in F) vs Month")+
  theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

data_fw%>%
  filter(!is.na(temp))%>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(temp) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  
  ggplot()+ aes(x =temp,y=avg_dep_delay,color = avg_dep_delay) +
  geom_smooth()+labs(x=" Temperature (in F)", y="Average Depature delay (in minutes)", title = "Temperature (in F) vs Average Depature Delay")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

According to the graph, low temperatures exhibit a high average delay, also the delay increases above 40 F with increasing temperature.Thus a moderate range of temperature is optimal for low average departure delays

data_fw %>%
  filter(!is.na(temp))%>%
 
  group_by(temp,season)%>%
  summarise(count_delay = sum(count_delayed),
            count = n())%>%
  

  ggplot(aes(x=temp,y=(100*(count_delay/count)))) +geom_boxplot(trim=FALSE,color="#00AFBB") +labs(x="Temperature (in F) ",y="Delay Percent (%)",title="Delay Percent (%) vs Temperature (in F) ")+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank()
        )+facet_wrap(season~.)

As shown by the graph, low temperatures are characterized by a high average delay, which increases with an increase in temperature. Due to this, the percentage of delay is higher in the summer and the winter.

6. Wind Speed

The wind-speed variables provide the speed of the wind at the departure airport during the hour of the scheduled departure time of the flight.

High wind speed can affect an aircraft’s operation safety, further leading to severe delays.

In the weather data set, the wind speed parameter also has null values. The data cleaning process here, similar to pressure, involves the removal of the null values using the na.rm() function, considering the fact that it takes only 0.02% of the total number of values in that column

data_fw%>%
  filter(!is.na(wind_speed))%>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(wind_speed,origin) %>%
  
  
  ggplot()+ aes(x =hour,y=wind_speed) +
  geom_smooth()+labs(x=" Hour", y="Wind speed (in mph)", title = " Wind speed (in mph) vs Hour")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())+facet_wrap(origin~.)

The variation of wind speed with time of day is called the diurnal cycle. Near the earth’s surface, winds are usually greater during the middle of the day and decrease at night. This is due to solar heating, which causes “bubbles” of warm air to rise.

From the graph we can say that JFK is the windiest airport among the three origins.

data_fw%>%
  filter(!is.na(wind_speed))%>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(wind_speed) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  
  ggplot()+ aes(x =wind_speed,y=avg_dep_delay,color = avg_dep_delay) +
  geom_smooth()+labs(x=" Wind speed (in mph)", y="Average Depature delay (in minutes)", title = " Wind speed (in mph) vs Average Depature Delay")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

In fact, take-off and landing are the only times during a flight when high winds can result in flight delays. Horizontal winds (also known as “crosswinds”) (about 25-35 mph) are generally prohibitive of take-off and landing.

data_fw %>%
  filter(!is.na(wind_speed))%>%
  filter(wind_speed > 0, wind_speed < quantile(wind_speed, 0.99)) %>%
 
  group_by(wind_speed,season)%>%
  summarise(count_delay = sum(count_delayed),
            count = n())%>%
  

  ggplot(aes(x=wind_speed,y=(100*(count_delay/count)))) +geom_line(stat = "identity",lwd=2,color="#00AFBB") +labs(x="Wind speed (in mph) ",y="Delay Percent (%)",title="Delay Percent (%) vs Wind speed (in mph) ")+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank()
        )+facet_wrap(season~.)

It is known that New York experiences its windiest weather during the spring and summer months of the year.

7.Wind Direction

In the weather data set, the wind direction parameter also has null values. The data cleaning process here, similar to wind speed, involves the removal of the null values using the na.rm() function, considering the fact that it takes only 1.76% of the total number of values in that column

data_fw%>%
  filter(!is.na(wind_dir))%>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(wind_dir) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  
  ggplot()+ aes(x =wind_dir,y=avg_dep_delay,color = avg_dep_delay) +
  geom_smooth()+labs(x="  Wind direction (in degrees)", y="Average Depature delay (in minutes)", title = " Wind direction (in degrees) vs Average Depature Delay")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

The graph of Average Depature Delay versus wind direction resembles a sinusoidal curve, with average departure delay attaining local maxima when the air direction is approximately 0 or 180 degrees (plus or minus 360), and local minima when the air direction is 90 or 270 degrees (plus or minus 360).

image: Wind Direction

Question 3 : What is effect of airport and carrier on departure delay? Which airport and carrier are the best and the worst ?

1.Analysis of Airport

Preliminary inspection involves investigating three measures as follows:

Percentage delay: The proportion of delayed flights to the total flights in each airport.

Relative percentage of flights delayed: This is the proportion of flights delayed in each airport relative to the total number of flights delayed in NYC. This plot essentially supports the percentage delay plot.

Time departure percentage: The proportion of the number of flights departing “on time” to the total number of flights in each airport.

Before visualizing the percentage delay, the number of delayed flights per airport and the total number of flights per airport are explored, where only positive departure delays (late departures) are taken into account.

Total Number of Flights per Airport

flights%>%
  filter(dep_delay < quantile(dep_delay, 0.99))%>%
  group_by(origin)%>%
  summarise(count=n())%>%
  ggplot(aes(y= count,x= reorder(origin,count),fill=count))+geom_bar(width = 0.5, stat="identity")+labs(y="Number of flights", x= "Airport", title="Airport vs Total Number of Flights Departing")+theme_bw()

Number of Delayed flights per airport

flights%>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99))%>%
  group_by(origin)%>%
  summarise(count=n())%>%
  ggplot(aes(y= count,x= reorder(origin,count),fill=count))+geom_bar(width = 0.5,stat="identity")+labs(y="Number of flights delayed", x= "Airport", title="Airport vs Number of Flights Delayed")+theme_bw()

From the above graphs, it’s clear that EWR and JFK have the highest and second highest number of flights departing and highest number of delayed flights. LGA has the lowest for both.

1.1 Percentage Delay of Airports

When plotting the percentage delay, only positive departure delays (flights departing late) are taken into account.

flights<-flights%>%
  mutate(count_delayed= ifelse(dep_delay>0,1,0))

tot_flights_airport<-flights%>%
  filter(dep_delay < quantile(dep_delay, 0.99))%>%
  group_by(origin)%>%
  summarise(tot_count=n())
  
delay_flights_airport<-flights%>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99))%>%
  group_by(origin)%>%
  summarise(delay_count=n())

flight_per_airport<-tot_flights_airport%>%
  inner_join(delay_flights_airport,by="origin")
 
flight_per_airport%>%
  group_by(origin)%>%
  summarise(delay_per= 100*(delay_count/tot_count))%>%
ggplot()+aes(y=delay_per, x=reorder(origin,delay_per),fill=origin)+geom_bar(width = 0.5,stat="identity")+labs(x="Airport", y=" Proportion of Delayed Flights out of Total Flights (%)", title="Percentage Delay of Flights Per Origin")+theme_bw()

From the above graph, it’s clear that EWR has the highest proportion of delayed flights of its total flights, being the worst performing airport in that regard. On the other hand, LGA is the best performing airport.

1.2 Relative Percentage of Flights Delayed

flights%>%
  filter(dep_delay < quantile(dep_delay, 0.99))%>%
  group_by(origin)%>%
  summarise(rel_per= 100*n()/nrow(flights))%>%
  ggplot()+aes(y=rel_per, x= reorder(origin,rel_per),fill=rel_per)+geom_bar(width = 0.5,stat="identity")+labs(y="Relative Percentage of Flights (%)", x="Airport",title="Percentage of Flights Delayed Relative to Total Number of Flights" )+theme_bw()

From the above plot, it can seen that EWR has the highest percentage of delayed flights(35.39%) and LGA has the lowest percentage (30.57%).

1.3 Time Departure Percentage per Airport

When plotting the time departure percentage, the departure delay is split into two categories ‘On Time’ and ‘Delayed’. ‘On Time’ takes into account only those flights with a negative(early departure) and/or zero departure delay. ‘Delayed’ considers the rest of the departure delays (late departures).

flights_trd<-flights%>%
  mutate(dep_category= ifelse(dep_delay <= 0,"on time", "delayed"))

flights_trd%>%
  filter(dep_delay < quantile(dep_delay, 0.99))%>%
  group_by(origin)%>%
  ggplot()+aes(x=origin,fill=dep_category)+geom_bar(width = 0.5)+labs(x="Airports",y="Number of Flights(Delayed/On Time", title="Count of Flights 'Delayed' and 'On Time' Per Airport")+theme_bw()

flights_trd%>%
  filter(dep_delay < quantile(dep_delay, 0.99))%>%
  group_by(origin)%>%
  summarise(per_dep=100*sum(dep_category=="on time")/n())%>%
  arrange(desc(per_dep))%>%
  ggplot()+aes(x=origin,y=per_dep, fill=origin)+geom_bar(width = 0.5,stat="identity")+labs(x="Airport",y="Percentage of time departure(%)",title="Time Departure Percentage Per Airport")+theme_bw()

From the plot for the time departure percentage, it’s clear that LGA has the highest proportion of flights departing on time (67.58%) of its total flights and is hence the best performing airport in terms of time departure percentage. EWR is the worst performing airport (time departure percentage- 55.85%).

In conclusion, purely on the basis of the above produced plots, it’s clear that EWR is the worst performing airport and LGA is the best performing. Considering the total number of flights, LGA has the lowest number of flights departing and EWR has the highest number. Holistically, this maybe be due to the fact that EWR and JFK are primarily international airports, bringing larger number of people, possibly greater boarding time, greater number of security checks and greater number of flights when compared to LGA, which is primarily a domestic airport.

1.4 Effect Airport on Average Departure Delay

flights%>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(origin) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  
  ggplot()+ aes(x =origin,y=avg_dep_delay,fill = avg_dep_delay) +geom_col(width = 0.5)+labs(x="Origin/Airport", y="Average Depature delay (in minutes)", title = "Origin  vs Average Depature Delay")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

Here, LGA has the highest average departure delay, even though, LGA it had the lowest proportion of delayed flights. JFK has the lowest average departure delay.This could be due to the fact that LGA is primarily a domestic airport, which resources far less compared to that of the international airport JFK.

In order to investigate the reason as to why LGA has the highest average departure delay, it’s departure delay will be analysed over the span of the year.

flights %>% 
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  mutate(date = ymd(paste(year, month, day))) %>%
  group_by(date,origin)%>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot()+aes(x =date, y = avg_dep_delay, group=origin, color=origin) + 
geom_smooth()+labs(x="Duration", y="Average Depature delay (in minutes)", title =  "Trend of Average Departure Delay by Duration for Origin")+
  theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

From the above plot, it’s clear that during the time period between summer and winter season, LGA has a higher average departure delay over the other two airports. This maybe due to the lack of sufficient workforce to deal with the rush experienced during these peak season, when travel is likely to occur.

2. Analysis of Carrier

Again, preliminary analysis involves investigating the percentage delay for the carrier and relative percentage of flights departing per carrier.

In this case, the relative percentage of flights departing is the proportion of flights relative to the total number of flights departing in NYC. The percentage delay is the The proportion of delayed flights to the total flights in each carrier.

Before that, the number of flights per carrier is explored.

flights%>%
  filter(dep_delay < quantile(dep_delay, 0.99))%>%
  group_by(carrier)%>%
  summarise(count=n())%>%
  ggplot(aes(y= count,x= reorder(carrier,count),fill=count))+geom_bar( stat="identity")+labs(y="Total Number of Flights ", x= "Carriers", title="Total Number of Flights per Carrier")+theme_bw()

flights%>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(carrier)%>%
  summarise(count=n())%>%
  ggplot(aes(y= count,x= reorder(carrier,count),fill=count))+geom_bar( stat="identity")+labs(y="Number of Flights Delayed", x= "Carriers", title="Carriers vs Number of Flights Departed Late")+theme_bw()

From the above plot, it’s clear that UA is the carrier with the highest number of delayed flights, followed by EV, B6 and DL.

The order is similar when it comes to the total number of flights, with UA having the highest total number of flights, followed by B6, EV and then DL

Exploring the percentage delay and relative percentage

Similar to the method followed in the analysis of airports, When plotting the percentage delay, only positive departure delays (flights departing late) are taken into account.

2.1 Percentage Delay of Carriers

flights<-flights%>%
  mutate(count_delayed= ifelse(dep_delay>0,1,0))

flights%>%
  filter( dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(carrier)%>%
  summarise(prop_delay=100 * mean(count_delayed))%>%
  ggplot()+aes(y=prop_delay, x= reorder(carrier,prop_delay),fill=prop_delay)+geom_bar(stat="identity")+labs(x="Carriers", y=" Proportion of Delayed Flights out of Total Flights (%)", title="Percentage Delay of Flights Per Carrier")+theme_bw()

From the above graph, it’s clear that carrier WN has the highest proportion of delayed flights of its total flights, being the worst performing carrier in that regard. WN is followed by FL,F9, UA and EV. On the other hand, HA is the best performing carrier.

flights%>%
  filter(dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(carrier,origin)%>%
  summarise(prop_delay=100 * mean(count_delayed))%>%
  ggplot()+aes(x=prop_delay, y= reorder(carrier,prop_delay),fill=prop_delay)+geom_bar(stat="identity")+labs(y="Carriers", x=" Proportion of Delayed Flights out of Total Flights (%)", title="Percentage Delay of Flights Per Carrier")+theme_bw()+facet_wrap(~origin)

From the above, a few things can be inferred. First, the worse performing carrier in this regard, WN is the unsurprisingly has the highest percentage delay at two of the three airport (EWR and LGA). There are also cases when there are no flights from certain carriers at certain airports. Take the case of carriers FL,F9 and YV. There seem to be no flights flying from the international airports EWR and JFK. This maybe be due to the fact that those carriers are primarily focused on domestic services. However, considering the fact that they only operate out of LGA, they have relatively high percentage delay, most probably due to the fact that they’ll are smaller run operations, with limited resources.

2.2 Relative Percentage of Flights Delayed

flights%>%
  filter(dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(carrier)%>%
  summarise(rel_per= 100*n()/nrow(flights))%>%
  ggplot()+aes(y=rel_per, x= reorder(carrier,rel_per),fill=rel_per)+geom_bar(stat="identity")+labs(y="Percentage of Flights (%)", x="Carriers",title="Percentage of Flights Delayed Relative to Total Number of Flights" )

Unsurprisingly, UA is the carrier, the carrier with the highest number of total flights and total delayed flights, contributes most to the relative percentage of flights delayed. However, considering the smaller size of WN and it having the highest percentage delay per carrier, WN shows a lot of promise as the worst performing carrier. However, EV also is another possible option along with B6

In terms of finding the best performing carrier, the process is not straightforward, similar to finding the worst possible one, and requires additional analysis, namely looking into the average departure delay per carrier

2.3 Effect Carrier on Average Departure Delay

tot_avg_dep_delay<-flights%>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99))%>%
  summarise(avg_dep_delay=mean(dep_delay,na.rm=TRUE))

flights%>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99))%>%
  
  group_by(carrier)%>%
  summarise(avg_dep_delay=mean(dep_delay,na.rm=TRUE))%>%
  ggplot()+geom_bar(aes(x= avg_dep_delay, y=reorder(carrier,avg_dep_delay)),fill="lightblue", width=0.5, stat="identity")+labs(x="Average Departure Delay (in mins)",y="Carriers", title= "Average Departure Delay vs Carrier")+theme_bw()+geom_vline(aes(xintercept=mean(tot_avg_dep_delay$avg_dep_delay), linetype= "Average Departure Delay in New York"),color='red')

In the above plot, although OO and YV seem to have the outright highest average departure delay, that result is quite deceptive. This is mainly due to the fact that OO and YV have very small number of flights departing. As a result for a less biased conclusion,the analysis on the basis of departure delay will only be focused on the carrier that have a substantial number of flights departing. Considering EV’s high number of flights departing and the fact that it has the third highest average departure delay, it can be concluded as the worst carrier of the bunch

Considering the fact that UA having the highest number of flights departing and the fact that it has relatively the lowest average departure delay of the other carriers having similarly high number of flights departing, UA shows a lot of promise as potentially the best carrier

In order to solidify that notion, a kind of calendar plot for a select few relevant carrier showing the departure delay per hour per month is created to observe if there are any jarring variations in any certain period, as in season/month when UA is not suitable.

2.4 Calender Diagram for Select Carriers

flights %>% group_by(carrier,month) %>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  filter(carrier %in% c('UA','B6','EV','DL','AA','MQ','US','9E'))%>%

  
  
  ggplot() +geom_tile(aes(x = hour, y = carrier, fill = dep_delay), color = 'black')+scale_fill_distiller(palette ='Spectral')+facet_wrap(~month)

Having thoroughly investigated the performance of UA in terms of departure delay over the twelve months, no jarring seasonal issues where observed, apart from the occasional spikes. So overall, over the twelve month period, UA has remained relatively consistent in terms of departure delay, thus solidifying the earlier notion that UA is the best performing carrier

Question 4 - What is the impact of plane manufacturer and structure of the aircraft on departure delays?

Setting up global variable for planes and airports

data_fp <- flights %>%
inner_join(planes, by = c("tailnum"))%>%
    mutate(count_delayed = ifelse(dep_delay > 0, 1, 0)) %>%
    mutate(season = ifelse(month %in% 9:11, "Fall",
                 ifelse(month %in% 6:8, "Summer",
                 ifelse(month %in% 3:5, "Spring","Winter"))))%>%mutate(age_of_plane = 2013-year.y) %>%
    mutate(date = ymd(paste(year.x, month, day))) %>%
   mutate(date1 = ymd(paste(year.y, month, day)))

head(data_fp)
##   ID year.x month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1  1   2013     1   1      517            515         2      830            819
## 2  2   2013     1   1      533            529         4      850            830
## 3  3   2013     1   1      542            540         2      923            850
## 4  4   2013     1   1      544            545        -1     1004           1022
## 5  5   2013     1   1      554            600        -6      812            837
## 6  6   2013     1   1      554            558        -4      740            728
##   arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
## 2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
## 3        33      AA   1141  N619AA    JFK  MIA      160     1089    5     40
## 4       -18      B6    725  N804JB    JFK  BQN      183     1576    5     45
## 5       -25      DL    461  N668DN    LGA  ATL      116      762    6      0
## 6        12      UA   1696  N39463    EWR  ORD      150      719    5     58
##              time_hour count_delayed year.y                    type
## 1 2013-01-01T10:00:00Z             1   1999 Fixed wing multi engine
## 2 2013-01-01T10:00:00Z             1   1998 Fixed wing multi engine
## 3 2013-01-01T10:00:00Z             1   1990 Fixed wing multi engine
## 4 2013-01-01T10:00:00Z             0   2012 Fixed wing multi engine
## 5 2013-01-01T11:00:00Z             0   1991 Fixed wing multi engine
## 6 2013-01-01T10:00:00Z             0   2012 Fixed wing multi engine
##   manufacturer     model engines seats speed    engine season age_of_plane
## 1       BOEING   737-824       2   149    NA Turbo-fan Winter           14
## 2       BOEING   737-824       2   149    NA Turbo-fan Winter           15
## 3       BOEING   757-223       2   178    NA Turbo-fan Winter           23
## 4       AIRBUS  A320-232       2   200    NA Turbo-fan Winter            1
## 5       BOEING   757-232       2   178    NA Turbo-fan Winter           22
## 6       BOEING 737-924ER       2   191    NA Turbo-fan Winter            1
##         date      date1
## 1 2013-01-01 1999-01-01
## 2 2013-01-01 1998-01-01
## 3 2013-01-01 1990-01-01
## 4 2013-01-01 2012-01-01
## 5 2013-01-01 1991-01-01
## 6 2013-01-01 2012-01-01
data_fa <- airports %>%
  inner_join(flights, c("faa" = "dest"))%>%
    mutate(count_delayed = ifelse(dep_delay > 0, 1, 0))

head(data_fa)
##   faa                              name      lat       lon  alt tz dst
## 1 ABQ Albuquerque International Sunport 35.04022 -106.6092 5355 -7   A
## 2 ABQ Albuquerque International Sunport 35.04022 -106.6092 5355 -7   A
## 3 ABQ Albuquerque International Sunport 35.04022 -106.6092 5355 -7   A
## 4 ABQ Albuquerque International Sunport 35.04022 -106.6092 5355 -7   A
## 5 ABQ Albuquerque International Sunport 35.04022 -106.6092 5355 -7   A
## 6 ABQ Albuquerque International Sunport 35.04022 -106.6092 5355 -7   A
##            tzone    ID year month day dep_time sched_dep_time dep_delay
## 1 America/Denver 27276 2013    10   1     1955           2001        -6
## 2 America/Denver 28256 2013    10   2     2010           2001         9
## 3 America/Denver 29216 2013    10   3     1955           2001        -6
## 4 America/Denver 30230 2013    10   4     2017           2001        16
## 5 America/Denver 30954 2013    10   5     1959           1959         0
## 6 America/Denver 31811 2013    10   6     1959           2001        -2
##   arr_time sched_arr_time arr_delay carrier flight tailnum origin air_time
## 1     2213           2248       -35      B6     65  N554JB    JFK      230
## 2     2230           2248       -18      B6     65  N607JB    JFK      238
## 3     2232           2248       -16      B6     65  N591JB    JFK      251
## 4     2304           2248        16      B6     65  N662JB    JFK      257
## 5     2226           2246       -20      B6     65  N580JB    JFK      242
## 6     2234           2248       -14      B6     65  N507JB    JFK      240
##   distance hour minute            time_hour count_delayed
## 1     1826   20      1 2013-10-02T00:00:00Z             0
## 2     1826   20      1 2013-10-03T00:00:00Z             1
## 3     1826   20      1 2013-10-04T00:00:00Z             0
## 4     1826   20      1 2013-10-05T00:00:00Z             1
## 5     1826   19     59 2013-10-05T23:00:00Z             0
## 6     1826   20      1 2013-10-07T00:00:00Z             0

1.Analysis of Manufacturer

1.1 Analysis of Manufacturer with respect to number of delayed flights

To understand the performance of the manufacturer first let us explore the number of delayed flights in each manufacturer. Below we see that ‘BOEING’,‘EMBRAER’,‘AIRBUS’,‘AIRBUS INDUSTRIE’,‘BOMBARDIER INC’,‘MCDONNELL DOUGLAS AIRCRAFT CO’,‘CANADAIR’ are on the top of the list. Hence carry forward this list for analysis

data_fp %>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(manufacturer) %>%
  summarise(count=n())%>% 
  ggplot(aes(y=reorder(manufacturer,count),x=count))+
  geom_bar(stat = 'identity',fill="steelblue")+
  
  labs(y="Manufacturer",x="Number of departure delays",title="Number of departure delays in each Manufacturer")+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        panel.background = element_blank()
        )

data_fp %>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%filter(manufacturer %in% c('BOEING','EMBRAER','AIRBUS','AIRBUS INDUSTRIE','BOMBARDIER INC','MCDONNELL DOUGLAS AIRCRAFT CO','CANADAIR'))%>%
  group_by(manufacturer)%>%
  summarise(Number_of_delayed_flight_in_1000=n()/1000,avg_dep_delay = mean(dep_delay, na.rm = TRUE))%>%

  ggplot(aes(y=reorder(manufacturer,avg_dep_delay),x=avg_dep_delay,fill=Number_of_delayed_flight_in_1000)) +geom_bar(stat = "identity") +geom_vline(aes(xintercept=33.44),linetype = 2)+labs(y="Manufacturer",x="Average Departure Delay (in minutes)",title="Average departure delay for each Manufacturer ")+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

Here from the graph we see that although Boeing has the highest number of delayed flights, the average departure delay is less than the mean of departure delay.Also Canad air has the highest average departure delay.

To further investigate the trend in the top three manufacturers ‘EMBRAER’,‘BOMBARDIER INC’,‘CANADAIR’ with highest average departure delay we infer that the average departure delay for Canad air increases from July . Bombardier and Embraer show a decreasing trend in average denatured delay from July and increases in December i.e the holiday season

data_fp %>% 
  
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  filter(manufacturer %in% c('EMBRAER','BOMBARDIER INC','CANADAIR'))%>%
  group_by(date,manufacturer)%>%
  
  
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot()+aes(x =date, y = avg_dep_delay, group=manufacturer, color=manufacturer) + 
geom_smooth()+labs(x="year", y="Average Depature delay (in minutes)", title =  " Trend of Average Departure Delay for 'EMBRAER','BOMBARDIER INC','CANADAIR' ")+
  theme(plot.title = element_text(hjust = 0.15),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

1.2 Analysis of Manufacturer with respect to average depature delay

data_fp %>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%group_by(manufacturer)%>%
  summarise(avg_dep_delay = 
              mean(dep_delay, na.rm = TRUE)) %>%
  ggplot() +
  aes(x=reorder(manufacturer,avg_dep_delay), y=avg_dep_delay,fill=avg_dep_delay)+
  geom_bar(stat="identity") +
  labs(
    title = "Average Departure Delays for Different Manufacturers",
    x = "Manufacturer",
    y = "Average Delay (mins)",
  ) +  ylim(0,90) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

AVIAT AIRCRAFT INC has previously manufactured aircrafts with considerably higher delay times. Since We are not able to observe a pattern in these graphs, we can plot the same graphs but for all the carriers to notice any patterns with carriers.

data_fp %>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%group_by(manufacturer,carrier)%>%
  summarise(avg_dep_delay = 
              mean(dep_delay, na.rm = TRUE)) %>%
  ggplot() +
  aes(x=reorder(manufacturer,avg_dep_delay), y=avg_dep_delay,fill=avg_dep_delay)+
  geom_bar(stat="identity") +
  labs(
    title = "Average Departure Delays for Different Manufacturers",
    x = "Manufacturer",
    y = "Average Delay (mins)",
  ) +  ylim(0,90) +facet_wrap(carrier~.)+
  theme_bw()

From the above graphs, we can observe that American Airlines Inc. (AA) has the highest delay times through out all carriers. This indicates that the delay may not be caused by manufacturer but the operations of the airlines.

2.Analysis of structure of the aircraft with respect to capacity(seats), engine, engine type

2.1.Aircraft Capacity or Seat

data_fp%>%
  filter(!is.na(seats))%>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(seats) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  
  ggplot()+ aes(x =seats,y=avg_dep_delay,color = avg_dep_delay) +geom_point()+
  geom_smooth()+labs(x=" Seat", y="Average Depature delay (in minutes)", title = " Seat vs Average Depature Delay")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

As we see that Average departure delay increases as the seats increases, specifically for huge flights. This delay may result due to circumstances where there is aircraft cleaning, baggage loading, fueling. Huge flights result in long lines for the passengers which causes increase in departure delay.

One other point that can be deduced from this graph is that the average departure delay is lowest at approximately 250 seats.

data_fp%>%
  filter(!is.na(seats))%>%
  group_by(seats_group=cut(seats,breaks= seq(0,450, by =75)),origin) %>%
  summarise(Total_count=n())%>%
  ggplot(aes(x = Total_count, y = reorder(factor(seats_group),
          Total_count),fill=seats_group)) +
  geom_bar(width=0.7, stat = "identity") +
  theme_bw(base_line_size = 0, base_size = 9) +labs(x="Number of Flights",
  y="Seat Number Group",title="Number of flights vs Seat per Origin")+facet_wrap(origin~.)

With respect to the origin, we observe that JFK has the highest number of huge flights followed by EWR. From the analysis we found that the number of delayed flights are higher in proportion in EWR and JFK. This could be due to the fact that LGA is primarily a domestic airport, which has less big flights compared to that of the international airport EWR, JFK.

Taking a look at the manufacturing companies, we understand that Boeing, Airbus, and Airbus Industries produce giant aircraft with seats ranging from 150 to 225. Additionally, Embraer has only one seat group that is 75 seats.

data_fp%>%
  filter(!is.na(seats))%>%
  filter(manufacturer %in% c('BOEING','EMBRAER','AIRBUS','AIRBUS INDUSTRIE','BOMBARDIER INC','MCDONNELL DOUGLAS AIRCRAFT CO','CANADAIR'))%>%
  group_by(seats_group=cut(seats,breaks= seq(0,450, by =75)),manufacturer) %>%
  summarise(Total_count=n())%>%
  ggplot(aes(x = Total_count, y = reorder(factor(seats_group),
          Total_count),fill=seats_group)) +
  geom_bar(width=0.7, stat = "identity") +
  theme_bw(base_line_size = 0, base_size = 9) +labs(x="Number of Flights",
  y="Seat Number Group",title="Number of flights vs Seat per  Manufacturer ")+facet_wrap(manufacturer~.)

2.2 Engine Type and Number of Engines

data_fp %>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  
  group_by(engine)%>%
  summarise(Number_of_delayed_flight_in_1000=n()/1000,avg_dep_delay = mean(dep_delay, na.rm = TRUE))%>%

  ggplot(aes(x=engine,y=avg_dep_delay,fill=Number_of_delayed_flight_in_1000)) +geom_bar(stat = "identity") +geom_hline(aes(yintercept=33.44),linetype = 2)+labs(x="Engine Type",y="Average Departure Delay (in minutes)",title="Average departure delay for each engine type ")+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

4 cycle engines show the highest departure delay delay time. However the Turbo fan has the highest number of delayed flights

data_fp %>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  
  group_by(engine,carrier)%>%
  summarise(Number_of_delayed_flight_in_1000=n()/1000,avg_dep_delay = mean(dep_delay, na.rm = TRUE))%>%

  ggplot(aes(x=engine,y=avg_dep_delay,fill=Number_of_delayed_flight_in_1000)) +geom_bar(stat = "identity") +geom_hline(aes(yintercept=33.44),linetype = 2)+labs(x="Engine Type",y="Average Departure Delay (in minutes)",title="Average departure delay for each engine type ")+theme_bw()+facet_wrap(carrier~.)

American Airlines Inc. (AA) potentially have the highest number of 4 cycle engines, which causes the average delay time for American Airlines to higher than other.

data_fp %>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(engines)%>%
  summarise(Number_of_delayed_flight_in_1000=n()/1000,avg_dep_delay = mean(dep_delay, na.rm = TRUE))%>%

  ggplot(aes(x=engines,y=avg_dep_delay,fill=Number_of_delayed_flight_in_1000)) +geom_bar(stat = "identity") +geom_hline(aes(yintercept=33.44),linetype = 2)+labs(x="Number of Engines ",y="Average Departure Delay (in minutes)",title="Average departure delay vs Number of engines ")+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

Aircrafts with 3 engines usually are running early. However, all other number of engines show high delay times. As the number of engines increase from 1 to 2 to 4, the delay time also increases.

data_fp %>% 
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(engine,engines) %>%  
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x =engines, y = avg_dep_delay, group=engine, color=engine)) + scale_x_continuous(breaks = c(1,2,3,4))+
  geom_line(lwd = 2) +labs(x="Number of Engines", y="Average Depature delay (in minutes)", title =  " Trend of Average Departure Delay vs Number of engines for each engine type ")+geom_hline(aes(yintercept=33.443),linetype = 2)+
  theme_bw()

The graph above demonstrates that the departure delay decreases with increased engine number for reciprocating and turbo-fan engines. However, the same cannot be said for four-cylinder engines and turbo-jets

2.3 Age of the air craft

data_fp %>%
  filter(!is.na(age_of_plane ))%>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(age_of_plane) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  
  ggplot()+ aes(x =age_of_plane,y=avg_dep_delay,color = avg_dep_delay) + geom_point()+
  geom_smooth(method="lm")+labs(x="Age of the plane in years", y="Average Depature delay (in minutes)", title = "Average Depature Delay vs Age of the plane")+theme_bw()+theme(plot.title = element_text(hjust = 0.6),
        legend.title = element_text(size = 8),
        legend.text = element_text(size = 6),
        strip.text.x=element_text(size=8),
        strip.background = element_blank(),
        panel.background = element_blank())

As the age of the aircraft does not correlate well with the average departure delay, we will analyze it by categorizing it by the manufactures that have the greatest number of delayed flights

data_fp %>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  filter(manufacturer %in% c('BOEING','EMBRAER','AIRBUS','AIRBUS INDUSTRIE','BOMBARDIER INC','CANADAIR'))%>%
  group_by(age_of_plane,manufacturer)%>%
  summarise(Number_of_delayed_flight_in_1000=n()/1000,avg_dep_delay = mean(dep_delay, na.rm = TRUE))%>%

  ggplot(aes(x=age_of_plane,y=avg_dep_delay,fill=Number_of_delayed_flight_in_1000)) +geom_bar(stat = "identity")+scale_x_continuous(breaks = c(5,10,15,20,25,30) ) +xlim(0, 30)+geom_hline(aes(yintercept=33.44),linetype = 2)+labs(x="Age of the plane",y="Average Departure Delay (in minutes)",title="Average Depature Delay vs Age of the plane per manufacturer")+theme_bw()+facet_wrap(manufacturer~.)

It is evident that airlines that use Boeing aircraft of all ages and the average departure delay is lower than the average departure delay. This indicates that Boeing aircraft are well maintained and have good availability of service and parts. Also Embraer has high departure delay throughout the all ages of the aircraft

data_fp %>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(year.y)%>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot()+
  aes(x=year.y, y=avg_dep_delay)+
  geom_line(lwd=2,colour="steelblue") +
  labs(
    title = "Average Departure Delay Time v/s Year of Manufacture",
    x = "Year of Manufacture",
    y = "Average Departure Delay (mins)",
  )+
  theme_bw()

From this plot, we can infer that there is no linear correlation between year of manufacture and delay time. However, we can notice that after every few years the delay time starts increasing and then again decreases. This can be caused due to technological advances being made in the industry to innovate.

Question 5 : Is there a pattern to the departure delay in terms of geography of our analysis?

1. Timezones of the airports

world <- map_data("world")
ggplot() +
  geom_map(data = world, map = world,aes(x=long,y=lat,map_id = region),color = "black", fill = "lightgray")+
  geom_point(data=airports,mapping=aes(x=lon,y=lat,col=tzone))+
  labs(
    title="Timezones of Airports",
    x = "Longtitude",
    y = "Latitude"
  ) 

The United States is spread across six time zones. From west to east, they are Hawaii, Alaska, Pacific, Mountain, Central, and Eastern.

2. Average Depature Delay in each destination airport

The average departure delay in destination airports are concentrated on the eastern region of the USA

The eastern seaboard contains states such as Massachusetts, New York, New Jersey, Virginia, North Carolina, South Carolina, Georgia, and Florida.These are the main and popular states of USA

data_fa %>%
  filter(dep_delay > 0, dep_delay < quantile(dep_delay, 0.99)) %>%
  group_by(faa, lat, lon) %>%
  summarise(avg_dep_delay = mean(dep_delay, na.rm = TRUE),Number_of_delayed_flight_in_1000 = n()/1000) %>%
  ggplot() +
  aes(x = lon, 
      y = lat, 
      color = avg_dep_delay, 
      size = Number_of_delayed_flight_in_1000
      ) +
  geom_point() +borders("state") +
  labs(
    title = "Average Departure Delays From NYC by Destination",
    x = "Longtitude",
    y = "Latitude"
  ) 

Recommendations

This section contain the recommendations based on the exploratory analysis of departure delay

Question - Is there a pattern to the departure delay in terms of time? (Month, Day of week and Hour)

Average departure delay by hour

The graph below shows that during the day, the average departure delay is considerably lower in the early hours compared to the latter part of the day. Following the data in the graph, the preferred time to fly is between 5 a.m. and 8 a.m. to avoid delays since the average departure delay is approximately 20 minutes. So, it’s recommended that airports make use of this situation by scheduling more flights in the early hours to reduce the stress during the rest of the day. This would require a thorough analysis and revamping of the airport flight slot scheduling system already in place.

image: Data set overview

Number of delayed flights by month

From the below graph, it’s evident that the summer and winter seasons,which corresponds to the holiday season, contributes most to the number of delayed flights. The summer season would attract many international tourists, and winter would bring many domestic tourists visiting family and friends. The high volume of passengers would lead to longer queues at flight check-in counters, which will often cause flight departure delays. To mitigate this, the airport should increase the workforce available and open additional counters for security checks and flight check-in.

Since this surge in passengers during the holiday seasons are temporary spikes relative to the entire year, another recommendation is to open temporary airstrips, areas which can easily be converted to productive land, used for agriculture, or set up solar energy infrastructure after the peak passes. Along the same lines, during this period, another recommendation is to maintain a separate landing strip for private planes so that their slots can be allocated to other public planes.

image: Data set overview

Average departure delay vs. week

From the below graph, it’s clear that Tuesday and Wednesday show a lower average departure delay than the mean value. This could be because flying mid-week typically requires time off work, which typical working professionals could find difficulty. Consequently, the airports are relatively quieter, resulting in fewer delays. To optimize the airport’s operational cost, it’s advisable to rework the staff scheduling strategy to reflect the lower operational demands.

image: Data set overview

Question - How does weather impact flights from NYC? What is the effect of weather on departure delay?

Weather

From the investigation of different weather parameters’ impact on the departure delay, it’s evident that adverse weather conditions (low visibility, high precipitation, high relative humidity, etc.) significantly negatively impact the delay.One recommendation to improve departure delay during adverse weather conditions is to invest in and enhance the take-off strip lighting system at the airports (EWR, JFK, LGA), essentially helping the pilot effectively manage these situations. It’s also advisable for the airport authorities to have specialist staff in place during adverse conditions to promptly remove debris and other objects from the runway. There’s also the possibility of emergency flights landing in the airports in question that can cause unexpected departure delays for the flights about to take off. To mitigate this, it is advised to have proper communication systems to communicate seamlessly with air traffic control.

Question - What is effect of airport and carrier on departure delay? Which airport and carrier are the best and the worst ?

Analysis of Airport

Airport vs Average Departure Delay

From the below graph, it’s clear that of the three airports, LGA has the highest average departure delay, and JFK has the lowest. As mentioned earlier, this is probably due to the fact that LGA is a domestic airport and is smaller compared to EWR and JFK, which are international airports with large capacities. In general, smaller airports usually have limited infrastructure, facilities, and operational resources that affect not just the flight schedule but also the number of flights that can be accommodated at the airport at the same time, which could lead to significant flight departure delays, leading to the propagation of that delay to other flights. To reduce the departure delay at LGA, it’s advisable to optimize the airport ground usage to maximize the area available so that the airport can accommodate a more significant number of flights. Another suggestion is to reduce the buffer time between consequent flight take-off to prevent departure delay propagation.

image: Data set overview

Airport vs Percentage Delay of Flights.

From the graph below,it`s clear that EWR’s high percentage delay is another issue because it has the highest number of flights departing from it. That implies it affects a more significant number of people. A few recommendations to reduce the delays in flight departure at EWR are: increasing the number of security check and flight check-in counters, optimizing logistics involved in the airport operation by investing more money into it, and ensuring proper maintenance of core elements influencing airport operation like runways, taxiways, etc. The final recommendation is to invest in airport monitoring technology. With AI, existing security and observation cameras at airports and airlines will automatically detect delays in these turnaround services. This alerts airport staff or ground crew members to the issue in real-time so that they can formulate a mitigation strategy.

image: Data set overview

Analysis of Carrier

Considering the number of delayed flights, percentage delay, and average departure delay, EV was the worst carrier in terms of its performance. One recommendation to reduce the departure delay is to optimize the carrier’s security, check-in, and other operational policies to maximize the efficiency of its operation. It’s also advisable for the airline to revamp its flight maintenance strategy to prevent possible issues leading up to take-off. The carrier should also ensure that the pilots they employ have the experience and know-how to deal with difficult situations. One final recommendation is for the carrier to make virtual flight check-in compulsory for all passengers via an app allowing users to upload their digital certificates, eliminating paperwork, and allowing smartphones to serve as our digital ID.

Question - What is the impact of plane manufacturer and structure of the aircraft on departure delays?

Effect of manufacturer and structure of the plane

From the analysis carried out earlier, it`s clear that the manufacturer has an impact on the flight departure delay. When it comes to the manufacturer specifically, one suggestion is for airport authorities to set stringent guidelines for the quality of flights from the manufacturer to ensure that their flights meet the industry standards and compliance, blocking flights from manufacturers that don’t meet the desired quality. This will ensure the timely departure of flights from the airport without any technical issues related to the manufacturer. One way that manufacturers can help reduce possible departure delays is by innovating and improving the on-board flight instrumentation and sensor so that they can function properly even in adverse weather conditions.

In terms of the plane’s structure, the number of engines, and the type of engine, small planes, especially light planes, aren’t the most practical choice in strong winds or heavy rain. Turbulence is more likely to affect smaller and lighter planes. Larger aircraft, like commercial jets, have multiple engines and are generally more significant and have more endurance than the more compact ones. They can withstand strong winds and heavy rains a lot easier. The fact that Boeing creates large commercial flights (inferred from the earlier investigation) could be the reason why the average departure delay of Boeing is lower than the others, regardless of age.

Conclusion

In conclusion, having been approached by the Port Authority of New York and New Jersey (PANYNJ) to find possible issues and corresponding solutions related to departure delay, the NYC data sets provided were explored, and exploratory questions were formulated to reduce the departure delay. Once the questions were formulated, they were analyzed to gain insights using exploratory and visual analytical techniques. In the end, based on the insight derived, recommendations were provided in the context of the exploratory questions. A Tableau dashboard was also created to complement the analysis.

References

Aswesawit, 2022. How to Avoid Flight Delays: 7 Tips for Travelers[Online]. Available from: https://www.aswesawit.com/how-to-avoid-flight-delays/ [Accessed December 07, 2022].

Sherburnaeroclub,2022. Flying in bad weather[Online]. Available from: https://www.sherburnaeroclub.com/blog/flying-in-bad-weather#factors-that-affect-aircraft-safety-in-bad-weather/ [Accessed December 07, 2022].

BBC, 2022. The airport tech helping to prevent delayed flights[Online]. Available from: https://www.bbc.co.uk/news/business-60228430/ [Accessed December 07, 2022].

Contribution List

Question 1:Aravind Gopakumar,Umapujitha Singh
Question 2:Umapujitha Singh
Question 3:Aravind Gopakumar
Question 4:Piyush Jain
Question 5:Umapujitha Singh
Tableau: Aravind Gopakumar, Umapujitha Singh
Report Making: Aravind Gopakumar,Piyush Jain, Umapujitha Singh